Some words to start with

Welcome to the page! This is an essential part of my final assignment to the course “Introduction to Open Data Science”, or as we friends call it “IODS” course. My name is Laura Matkala and I am a PhD student who studies forests. I have to say this is one of the most inspiring courses I have taken in a while. I will do my best with all the new skills I have learned during the course to make a best possible outcome for this assignment!

knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/49.jpg')
Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn't look like it now, the sun actually does exist...)

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn’t look like it now, the sun actually does exist…)

About the dataset

I chose to use the dataset Boston, which includes data about housing in the suburbs of Boston , Massachusettes, USA. I will later perform linear regression and logistic regression to the variable “crim”, but first some basic information about the dataset.

knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/boston.jpg')
The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

I didn’t do any data wrangling related to the part where I will perform linear regression, but I did something for the latter logistic regression part. You can find the R script file with all the data wrangling and codes here. The variables in the dataset are:

Analysis

I will look into the same hypotheses in both analysis

The hypotheses are:

1. Index of accessibility to radial highways affects per capita crime rate.

2. Full-value property-tax rate does not affect per capita crime rate.

3. Proportion of non-retail business acres does not affect per capita crime rate

Linear regression

Data exploration

To start with the linear regression I need to read in the data as well as call the needed packages. I will also check the structure and dimensions of the data to see that everything is in order after the data wrangling and saving the file as csv.

Boston<-read.csv(file = "C:/HY-Data/MATKALA/GitHub/IODS-final/Boston.csv", header = TRUE, sep=",")
library(GGally); library(ggplot2)
str(Boston)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ crime  : Factor w/ 4 levels "high","low","med_high",..: 2 2 2 2 2 2 4 4 4 4 ...
dim(Boston)
## [1] 506  15

Everything seems to be ok with the data and it looks like I meant it to look like at this point. Let’s make a couple of plots to see what the data looks like.

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

Ok, we see that “rad”, “tax” and “indus” have high correlations with “crim”. I will thus use those variables for creating a linear multiple regression model, where the three first mentioned variables are used as explanatory variables for “crim”. A linear multiple regression model in this case takes the form \(y = \alpha+\beta_1x_1+\beta_2x_2+\beta_3x_3+\epsilon\), where rad, dis and ptratio are estimates of \(\beta\), \(\alpha\) is an intercept and \(\epsilon\) is the error term/random noise.

Building the model

Let’s start modeling!

my_model <- lm(crim ~ rad + tax + indus, data = Boston)
summary(my_model)
## 
## Call:
## lm(formula = crim ~ rad + tax + indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.180  -1.457  -0.248   0.763  76.418 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.048463   1.142904  -2.667  0.00789 ** 
## rad          0.563131   0.084871   6.635 8.42e-11 ***
## tax          0.001631   0.005083   0.321  0.74840    
## indus        0.055531   0.064352   0.863  0.38859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.72 on 502 degrees of freedom
## Multiple R-squared:  0.3932, Adjusted R-squared:  0.3896 
## F-statistic: 108.4 on 3 and 502 DF,  p-value: < 2.2e-16

In linear regression the aim is to minimize the residuals, which are the prediction errors of the model. The best model fit is found so that the sum of the squared residuals is minimized. The \(\beta\) values this model gives us are 0.6, 0.001 and 0.06,respectively, whereas the \(\alpha\) (intercept) is -3.0. Standard error for \(\alpha\) is 1.1 and for \(\beta\)s 0.08, 0.005 and 0.06, respectively. Based on p-values it seems that only “rad” is important for the model. I will thus leave “tax” and “indus”out and formulate another model.

my_model2<- lm(crim ~ rad, data = Boston)
summary(my_model2)
## 
## Call:
## lm(formula = crim ~ rad, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

Well, no dramatic changes compared to the previous situation if we look at the residuals. The standard error for “rad” as well as its p-value got smaller, though, compared to model number 1. The same happened to the intercept, so I guess in this sense the latter model is better than the first one. Anyway, I come to the same conclusion as I did when looking into this dataset in the exercise where we used LDA: with higher index of accessibility to radial highways there are better possibilities to access the highway, and thus escape after committing a crime. In other words, all my hypotheses are supported and confirmed.

Logistic regression

The model and its odds ratio

It’s time to do another kind of analysis for the same data. I choose to use logistic regression for this. I will use the same explanatory variables as with linear regression, except that for logistic regression I am using the dependent variable “crime” instead of “crim”, since the dependent variable in logistic regression needs to be categorical. So “crime” is categorical, but basically otherwise the same as “crim”. As I mentioned earlier, here you can see how I have done the mutation of the dependent variable.

Here is the model, its summary and coefficients.

m <- glm(crime ~ rad + tax + indus, data = Boston, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = crime ~ rad + tax + indus, family = "binomial", 
##     data = Boston)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.01639  -0.22661   0.02370   0.04368   2.48870  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.785353   3.590682   3.004 0.002667 ** 
## rad         -0.455414   0.132091  -3.448 0.000565 ***
## tax          0.002602   0.007998   0.325 0.744938    
## indus       -0.256292   0.176308  -1.454 0.146041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 570.179  on 505  degrees of freedom
## Residual deviance:  59.836  on 502  degrees of freedom
## AIC: 67.836
## 
## Number of Fisher Scoring iterations: 10
coef(m)
##  (Intercept)          rad          tax        indus 
## 10.785353478 -0.455413629  0.002601907 -0.256291520

Since the explanatory variables are not categorical, all the necessary information is shown above. According to the summary table “rad” is once again the only explanatory variable that is necessary for the model. This supports hypotheses 2 and 3. Therefore, I will leave the other variables out, adjust the model to its final form, show its summary and present its odds ratios.

m2 <- glm(crime ~ rad, data = Boston, family ="binomial")
summary(m2)
## 
## Call:
## glm(formula = crime ~ rad, family = "binomial", data = Boston)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4429  -0.2225   0.0577   0.0731   2.4836  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.28246    1.29238   6.409 1.47e-10 ***
## rad         -0.47166    0.05816  -8.109 5.11e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 570.179  on 505  degrees of freedom
## Residual deviance:  62.752  on 504  degrees of freedom
## AIC: 66.752
## 
## Number of Fisher Scoring iterations: 8
library(tidyr)

OR <- coef(m2) %>% exp
CI<-confint(m2) %>% exp
cbind(OR, CI)
##                       OR       2.5 %       97.5 %
## (Intercept) 3953.9201755 573.4230913 1.548093e+05
## rad            0.6239669   0.5330939 6.840731e-01
knitr::include_graphics('C:/HY-Data/MATKALA/GitHub/IODS-final/figures/holy_moly.jpg')

What an earth is going on with the intercept odds?!? Howcome is it so high? If we look at the odds of “rad” it is approximately 0.6. This means, if I understood correctly, that if “rad” is high, there is a 60 % chance for “crime” to be high as well. But before I can say all my hypotheses are confirmed I need to look into the predictive power of the model.

The predictive power

library(dplyr)
probabilities <- predict(m2, type = "response")
Boston <- mutate(Boston, probability = probabilities)
Boston <- mutate(Boston, prediction = probability>0.5 )
select(Boston, rad, crime, probability, prediction) %>% tail(10)
##     rad    crime probability prediction
## 497   6 med_high   0.9957328       TRUE
## 498   6 med_high   0.9957328       TRUE
## 499   6  med_low   0.9957328       TRUE
## 500   6  med_low   0.9957328       TRUE
## 501   6  med_low   0.9957328       TRUE
## 502   1      low   0.9995948       TRUE
## 503   1      low   0.9995948       TRUE
## 504   1      low   0.9995948       TRUE
## 505   1  med_low   0.9995948       TRUE
## 506   1      low   0.9995948       TRUE
table(crime = Boston$crime, prediction = Boston$prediction)
##           prediction
## crime      FALSE TRUE
##   high       126    1
##   low          0  127
##   med_high     6  120
##   med_low      0  126

Well, the numbers look rather suspicious to me. Especially since it looks like when “rad” is high, “crime” is medium high, but at the same time when “rad” is high, “crime” is medium low…and the probabilities for both of these results is super high. Perhaps this just tells us that logistic regression is not the right method for analyzing this data. Seeing this I am not as excited about this model as I was before. “rad” is probably the only one of the explanatory variables that affects “crime”, but logistic regression is not the way to analyse the data.

Finally

This has been a very helpful course in both learning R and statistical methods with different types of datasets. Also, I want to thank the organizers for making this a course where learning truly is fun. This is not the case in in many courses though it should be. So thank you for being so enthusiastic about everything yourselves, that already shows good example for the students. I have already recommended this course for a couple of my friends. Keep up the good work!